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Abstract 

The hybrid linear modeling problem is to identify a set 
of ' d- dimensional affine sets in Wi D . It arises, for example, 
in object tracking and structure from motion. The hybrid 
linear model can be considered as the second simplest (be- 
hind linear) manifold model of data. In this paper we will 
present a very simple geometric method for hybrid linear 
modeling based on selecting a set of local best fit flats that 
minimize a global l\ error measure. The size of the local 
neighborhoods is determined automatically by the Jones' 
P2 numbers; it is proven under certain geometric condi- 
tions that good local neighborhoods exist and are found by 
our method. We also demonstrate how to use this algorithm 
for fast determination of the number of affine subspaces. 
We give extensive experimental evidence demonstrating the 
state of the art accuracy and speed of the algorithm on syn- 
thetic and real hybrid linear data. 

Supp. webpage: http://www.math. umn.edu/Merman/lbf/ 

1. Introduction 

Many data sets can be modeled as unions of affine sub- 
spaces. This Hybrid Linear Modeling (HLM) finds diverse 
applications in many areas, such as motion segmentation 
in computer vision, hybrid linear representation of images, 
classification of face images, and temporal segmentation of 
video sequences (see e.g., dE)). 

Several algorithms have been suggested for solving 
this problem, for example the i\"-flats (KF) algorithm 
or any of its variants (3] |4j 13 EJ EJ, Subspace Separa- 
tion [8 9, 10 1, Generalized Principal Component Analysis 
(GPCA) CQ, Local Subspace Affinity (LSA) (Til, Agglom- 
erative Lossy Compression and Spectral Curvature Cluster- 
ing (SCC) [12|. Some algorithms for modeling data by a 
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mixture of more general surfaces have been successfully ap- 
plied to HLM UJ HD . 

In this paper, inspired by ifBl [T6l [HI and ifTBT [T9l 
we will describe an extremely straightforward geometric 
method for hybrid linear modeling that can either be used 
in a stand alone manner or as an initialization of any of the 
above methods. The basic idea is that for a data set sam- 
pled from a hybrid linear model and a random point of it 
x: the principal components of a neighborhood of appro- 
priate size of x often give a good approximation to its near- 
est subspace. An appropriate neighborhood size needs to 
be larger than the noise, so that the affine cluster is rec- 
ognized. However, not too large so that the neighborhood 
intersects multiple clusters. Such neighborhoods (in which 
a subspace is clearly distinguished) always exist for points 
far enough from the intersection of subspaces (i.e., most of 
points), as long as the following two assumptions are satis- 
fied: Samples are sufficiently dense along local regions of 
the subspaces and data points sufficiently far from the inter- 
section of subspaces are mostly surrounded by neighbors of 
the same subspace (this is true when the affine Grassman- 
nian distance between subspaces is sufficiently large and the 
noise level is sufficiently small). 

The contributions of this work are as follows: we make 
precise the local fit heuristic, using the 1% version of Jones' 
(3 numbers lfT5l[T6l[T7l . and state a theorem that tells us un- 
der certain geometric conditions how to calculate the size 
of the optimal local neighborhood. Using this, we intro- 
duce a new algorithm for affine clustering based on the 
above heuristic. At each of a randomly chosen subset of the 
data, we build a candidate flat by calculating the principal 
components of a large neighborhood which still lies in only 
one affine cluster. The algorithm then selects among the 
best fit flats of each of the neighborhoods to build a global 
model using an l\ error energy. We show experimentally 
that this algorithm obtains state of the art accuracies on real 
and synthetic HLM problems while running extremely fast 
(often on the order of ten times faster than most of the pre- 
viously mentioned methods). Note that the two parts of the 
algorithm are independent and can be used with other al- 



gorithms. In particular, we can use the local fit heuristic to 
initialize other HLM algorithms. We will give experimental 
evidence to show that the A'-flats algorithm [7] is improved 
by such initialization. We also show how to use this fast 
algorithm to quickly determine the number of affine sub- 
spaces. 

The rest of this paper is organized as follows. In Sec- 
tion 12 we describe in greater depth the two parts of the 
above algorithm, and state a theorem giving conditions that 
guarantee that good neighborhoods can be found. Section[3] 
carefully tests the algorithm on both artificial data of syn- 
thetic hybrid linear models and real data of motion segmen- 
tation in video sequences. It also demonstrates how to de- 
termine the number of clusters by applying the fast algo- 
rithm of this paper together with the straightforward elbow 
method. Section |4] concludes with a brief discussion and 
mentions possibilities for future work. 

2. Randomized local best fit flats 

The algorithm partitions a data set X = 
{xi,X2,--- ,xn} C R-° into K clusters Xi, . . ., 
Xa', with each cluster approximated by a d-dimensional 
affine subspace, which we refer to as d-flats or flats. We 
sketch it as follows, while suppressing details that appear 
later in Algorithms [3] and [2 

Algorithm 1 HLM by randomized local best fit flats 

Input: X = {xi,X2, • • • ,x n } C M. D : data, d: dimension 
of subspaces, C: number of candidate planes, K: number 
of output flats/clusters (K < C), other parameters used 
by Algorithms [3] and |2] 

Output: A partition of X into K disjoint clusters {Xi}fL x , 
each approximated by a single flat. 
Steps: 

• Pick C random points in X 

• For each of the C points find appropriate local neigh- 
borhoods using Algorithmic] 

• Generate C flats (by PCA) for the C neighborhoods 
of the previous step 

• Choose K flats from the C flats above using Algo- 
rithm |2] 

• Partition X by sending points to nearest K flats above 

The proposed algorithm breaks into two main parts. The 
first part finds a set of candidate flats. It takes as input the 
dimension of the flats to be found and the number of can- 
didates to search for. It starts by randomly selecting one 
point for each candidate flat. The algorithm chooses a scale 
(that is, a number of neighbors) around each of the seed 
points. The best fit flats (in L 2 sense) for each of the cho- 
sen neighborhoods are collected as candidates. The method 



Algorithm 2 Greedy t\ candidate selection for HLM by 

randomized local best fit flats 

Input: X = {xi,X2,--- ,x„} C ~$l D : data, K: number 

of flats, Li,...,Lc'- candidate flats, and p: number of 

passes. 
Output: A set of K "active" flats C C {L\, ..., Lc} . 

Steps: 

Initialize L by randomly choosing K "active" flats 

LAii •••■> La k 

for pass = 1 to p do 

Pick a random flat L Al C C (1 < I < K) 

forj = ltoC-K do 

• Pick one of the "inactive" flats Lj and form the 
collection of flats Cj = Lj{J £\ La l 

• Set Sj = J2i=i mm Le£j W Xi ~ Pi<Xi\\ 
end for 

Ifmirij-Sj < EiIi mill ie{i.A 1 ,...,iA K } IN ~ p LXi\\, 

Set Li Ai ■ — -^argmin s j 

end for 



for choosing the best scale is described in Section [2TTI and 
sketched in Algorithm^ 

The second part of the algorithm searches for a good set 
of flats from the candidates in a greedy fashion. A number 
K of desired flats and a measure of goodness of a K tuple 
of flats G = Gx(Li, ...Lk) is chosen; here, it will be the 
average £i distance of each point to its nearest flat. After 
randomly initializing K flats from the list of candidates, p 
passes are made through the data points. One of the current 
choices of flats is removed, and all the other candidates are 
tried in its place. If G decreases, we replace the current flat 
with the one which gives the lowest value for G. We then 
move to the next pass, picking a random flat, etc. 

The simplest choice of G is the sum of the squared dis- 
tances of each point in X to its nearest flat. In our exper- 
iments, we will use an l\ energy, i.e., summing the dis- 
tance of every point to its chosen flat. We have experi- 
mentally found that this energy is more robust to outliers 
than least squares error (see also |20] for a similar conclu- 
sion with a different implementation of t\ subspace mini- 
mization and [21 1 for partial theoretical justification). One 
can also imagine using spectral distances that measure the 
smoothness of the clusters with respect to some kernel, or 
many other global energy functionals of a partition. The 
nice thing about this method is that it allows for energy 
functionals which may be hard to minimize; since we are 
only testing the energy of our candidate configurations, as 
long as we can compute the energy of a partition quickly, 
we can run the greedy descent. 

2.1. Choosing the optimal neighborhood 

Choosing the correct neighborhood is crucial for the sue- 



Algorithm 3 Neighborhood size selection for HLM by ran- 

domized local best fit flats 

Input: X = {xi, X2, • • • , x„} C R . data, x: a point in 

X, S: start size, T: step size, £, m (optional): mean shifts 

parameters. 
Output: VV(x): a neighborhood of x. 

Steps: 

• (Optional) Update the point x as the center of its £- 
nearest neighborhood in X, while repeating m times 

• k = -1 
repeat 

• k:=k+l 

• Set A4 to be the S + kT nearest points in X to x 

• Set L k to be the best fit flat to J\f k 

• Compute (3 2 (k) '■= /^(-A/fc) according to (fTJ 
until k > 1 and p 2 (k - 1) < min{/3 2 (fc - 2),/3 2 (k)} 

• Output 7V(x) := A4-1 



cess of the method. If the neighborhood is too small, even 
if the point is in a good affine cluster, then a small amount 
of noise in the data will result in a flat which does not match 
most of the points in the affine cluster. If the neighborhood 
is too large, it will contain points from more than one affine 
cluster, and the resulting best fit flat will again not match 
any of the actual data points. While it is possible to take 
a guess at the correct scale as a parameter, we have found 
that it is possible to choose the correct scale reasonably well 
automatically. 

What we will do is start at the smallest scale (say d + 1) 
and look at larger and larger neighborhoods of a given point 
Xo. At the smallest scale, any noise causes the local neigh- 
borhood to look D dimensional. As we add points to the 
neighborhood, it becomes better and better approximated 
in an average sense by its best fit flat, until points belong- 
ing to other flats enter the neighborhood. We thus take the 
neighborhood which is the first local minimum of the scaled 
least squares error for d-flat approximation. In practice, for 
a neighborhood J\f of xo the scaled least squares error for 
(i-flat approximation, f3 2 (J\f), is computed by the formula: 
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(1) 



where Pl denotes the projection onto the flat L. This no- 
tion of scaled error introduced and utilized in lfl5l[T6l[T7ll . 
and considered recently in lTT8l [191 for dimension estima- 
tion. The procedure we have just described is summarized 
in Algorithm[3] 

The following theorem tries to justify our strategy of fit- 
ting the correct scale around each point. We work with a 
"geometric" set of assumptions in the continuous setting, 
where our data set will be presumed to be a collection of 



tubes around flats. This corresponds roughly to a proba- 
bilistic setting of sampling according to mixtures of uniform 
distributions around subsets of d-flats. For convenience we 
assume infinite tubes but restrict to local scales. 

The analog of the discrete f3 2 introduced earlier when 
having an underlying continuous set il (here it is the union 
of tubes) in a ball of center x and radius r is defined as 
follows: 



2 (x, r) = mm 



dx 



/dist(x, L) 

\ 2~r J vol(fi n B(x, r)) 



SlnB(x,r) 

where the minimum is over all d-flats L (see also ifTTl ). 

Theorem 2.1. Let K > 2, d < D , L ir i = 1, . . . , K, be 

K d-flats in M. D , and Hi := T(Li, Wi) be K tubes in R D 
around these flats of comparable widths {wi}fL 1 . 
For fixed 1 < i* < K and fixed x 6 Lj», let 



and 



y = y(x) = argmin ye[Arv dist(y, x) (2) 



r :=dist(y,x). (3) 



Assume that r$ > Wi*. Then the function ^(x, r) is con- 
stant for r in [0,Wi*], comparable to a function which is 
decreasing for a sufficiently large subinterval of [wi* , 7"o], 
and satisfies the inequality 



&((l + e)T )&#j(ro) 



(4) 



for sufficiently small e, i.e., it has an "approximate" local 
minimum in the interval [ro, (1 + e) • r*o]. If d < 4, then 
e « Wi'/ro, and if d > 4 then e ~ [wi*/ro] . As 
Wi* /ro approaches zero, all comparability constants men- 
tioned above approach one. 

We remark that by imposing an upper bound on the 
widths of the tubes in the theorem above and a lower bound 
on the dihedral angles between the flats, then the local 
condition ro > Wi* (required by the theorem) is satis- 
fied at any point x which has distance larger than order of 
maxi<j<i<- Wi from the intersection of all flats. 

2.2. Some technical notes about the proposed algo- 
rithm 

Note that the first minimum in the Theorem excludes the 
left endpoint. In our experiments, we noticed that on data 
without too much noise, it is useful to allow the first scale 
to count as a local minimum. In the experiments below, we 
will show the results of the algorithm with both notions of 
"first" local minimum. 

The second technical detail concerns the choice of the 
random points used for candidate generation. We use the 



mean shift technique: given a point x, update x as the cen- 
ter of its neighborhood several times. The method shifts the 
point to a denser region, resulting in a more accurate estima- 
tion of the flats. In the experiments below, we will show the 
results with and without mean shift biased seed selection. 

3. Experimental results 

In this section, we conduct experiments on artificial and 
real data sets to verify the effectiveness of the proposed al- 
gorithm in comparison to other hybrid linear modeling al- 
gorithms. 

We measure the accuracy of those algorithms by the rate 
of misclassified points with outliers excluded, that is 

_. # of misclassified inliers _. 

error% = - . x 100% . (5) 

# of total inliers 

In all the experiments below, the number C in Algo- 
rithm Q] is 70 times the number of subspaces, the number 
p in Algorithm |2] is 3 times the number of subspaces, and 
the number T in Algorithm [3] is 2. We run experiments 
with and without mean shifts; the experiments using mean 
shifts use 10-nearest neighbors and 5 shifts. According to 
our experience the LBF algorithm is very robust to changes 
in parameters, but unsurprisingly, there is a general trade off 
between accuracy (higher C, higher p, smaller T), and run 
time (lower C, lower p, larger T). We have chosen these 
parameters for a balance between run time and accuracy. 

3.1. Simulated data 

We compare our algorithm with the following al- 
gorithms: Mixtures of PPCA (MoPPCA) |0), if -flats 
(KF) 0, Local Subspace Analysis (LSA) fl3], Spec- 
tral Curvature Clustering (SCC) 0~2], Random Sam- 
ple Consensus (RANSAC) |22) and GPCA with vot- 
ing (GPCA) [2|. We use the Matlab codes of the 
GPCA, MoPPCA and KF algorithm from http://percep 
tion.csl.uiuc.edu/gpca, the SCC algorithm from http://www 
.math.umn.edu/~lerman/scc and the LSA, RANSAC algo- 
rithms from http://www.vision.jhu.edu/db. 

The MoPPCA algorithm is always initialized with a ran- 
dom guess of the membership of the data points. The 
LSCC algorithm is initialized by randomly picking 100 x K 
(d + l)-tuples (following lTT2l ). and KF are initialized with 
random guess. Since algorithms like KF tend to converge to 



The RANSAC code we use (and most standard versions of RANSAC) 
depend on a user supplied inlier threshold. The first part of our algorithm 
can in some sense be considered to be the automatic detection of this in- 
lier threshold; and if this is provided by the user, the initialization we have 
described is no longer useful, as we would simply pick the largest neigh- 
borhood so that the distance from any point to its projection is smaller than 
the user supplied bound. The experiments in the table use the oracle choice 
of inlier bound (given by the true noise variance), and so here RANSAC 
has an advantage over the other algorithms listed. 



local minimum, we use 10 restarts for MoPPCA, 30 restarts 
for KF, and recorded the misclassification rate of the one 
with the smallest £ 2 error for MoPPCA as well as KF. The 
number of restarts was restricted by the running time and 
accuracy. RANSAC uses the oracle inlier bound given by 
the model's noise variance. 

The simulated data represents various instances of K 
linear subspaces in MP. If their dimensions are fixed and 
equal d, we follow [12] and refer to the setting as d K € 
M D . If they are mixed, then we follow J2J and refer to 
the setting as (d\, . . . , dx) G ^P ■ Fixing K and d (or 
d%, . . . , dj<), we randomly generate 100 different instances 
of corresponding hybrid linear models according to the code 
in http://perception.csl.uiuc.edu/gpca. More precisely, for 
each of the 100 experiments, K linear subspaces of the 
corresponding dimensions in M D are randomly generated. 
Within each subspace, the underlying sampling distribution 
is the sum of a uniform distribution in a d-dimensional ball 
of radius 1 of that subspace (centered at the origin for the 
case of linear subspaces) and a Z?-dimensional multivari- 
ate normal distribution with mean and covariance matrix 
0.05 2 • Idxd- Then, for each subspace 250 samples are 
generated according to the distribution just described. Next, 
the data is further corrupted with 5% or 30% uniformly dis- 
tributed outliers in a cube of sidelength determined by the 
maximal distance of the former 250 samples to the origin 
(using the same code). 

Since most algorithms (including ours) do not support 
mixed dimensions natively, we assume each subspace has 
the maximum dimension in the experiment. 

The mean (over 100 instances) misclassification rate of 
the various algorithms is recorded in Table Q] The mean 
running time is shown in Table [2] In each of the Tables, 
our algorithm is labeled LBF (Local Best-fit Flats); our al- 
gorithm with mean shifts and using the modified choice 
of good neighborhood described in section 12.21 is labeled 
LBFMS. 

3.2. Motion segmentation data 

We test the proposed algorithm on the Hopkins 155 
database of motion segmentation, which is available at 
http://www.vision.jhu.edu/data/hopkinsl55. This data con- 
tains 155 video sequences along with the coordinates of cer- 
tain features extracted and tracked for each sequence in all 
its frames. The main task is to cluster the feature vectors 
(across all frames) according to the different moving ob- 
jects and background in each video. 

More formally, for a given video sequence, we denote the 
number of frames by F. In each sequence, we have either 
one or two independently moving objects, and the back- 
ground can also move due to the motion of the camera. We 
let K be the number of moving objects plus the background, 
so that if is 2 or 3 (and distinguish accordingly between 



Table 1. Mean percentage of misclassified points in simulation for linear-subspace cases or affine-subspace case. The proposed algorithm 
as in Section |2"2l is in the row labeled LBFMS, and the "vanilla" version is in the row labeled by LBF 
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Table 2. Mean running time for linear-subspaces cases and affine-subspaces cases. The proposed algorithm as in Section |2T2l is in the row 
labeled LBFMS, and the "vanilla" version is in the row labeled by LBF. 
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two-motions and three-motions). For each sequence, there flats. 



are also N feature points y i , y2 , • • • , yjv £ R that are de- 
tected on the objects and the background. Let Zy £ R 2 
be the coordinates of the feature point y^ in the i th im- 
age frame for every 1 < i < F and 1 < j < N. Then 



,-ih 



5lj,Z2j 



,zfj\ £ 



p2F 



is the trajectory of the 



j"'° feature point across the F frames. The actual task of 
motion segmentation is to separate these trajectory vectors 
zi, Z2, • ■ • , zn into K clusters representing the K underly- 
ing motions. 

It has been shown 1 8 1 that under affine camera models 
and with some mild conditions, the trajectory vectors cor- 
responding to different moving objects and the background 
across the F image frames live in distinct affine subspaces 
of dimension at most three in M. 2F . Following this theory, 
we implement our algorithm with d = 3, and use affine 



We compare our algorithm with the following: im- 
proved GPCA for motion segmentation (GPCA) |J23l, K- 
flats (KF) [7 1 (implemented for linear subspaces), Local 
Linear Manifold Clustering (LLMC) |[T3l . Local Subspace 
Analysis (LSA) fl~Q, Multi Stage Learning (MSL) ||24l . 
Spectral Curvature Clustering (SCC) [12], Sparse Subspace 
Clustering (SSC) 04], and Random Sample Consensus 
(RANSAC) (22] |25] |26]. As before, our algorithm is la- 
beled LBF (Local Best-fit Flats); our algorithm with mean 
shifts and using the modified choice of good neighborhood 
described in section l2~2l is labeled LBFMS. 



For these algorithms, we copy the results from 
http://www.vision.jhu.edu/data/hopkinsl55 (they are based 
on experiments reported in J26l and lfl3l ) and 11271 . and we 
just record the mean misclassification rate and the median 



Table 3. The mean and median percentage of misclassified points for two-motions and three-motions in Hopkins 155 database, 
proposed algorithm as in Section |Z2l is in the row labeled LBFMS, and the "vanilla" version is in the row labeled by LBF 
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0.00 
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0.84 


0.00 
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2.35 
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Figure 2. Using our neighborhood choice to improve initialization of k-flats: the vertical axis is accuracy, and the horizontal axis is fixed 
neighborhood size in geometric farthest insertion for initialization of K flats. The red line is the result of using adapted neighborhoods. 
The data sets are #1,#2, and #3 as described in Section \3~4\ Random initialization leads to errors of .4 or greater for all three data sets. 



misclassification rate for each algorithm for any fixed K 
(two or three-motions) and for the different type of motions 
("checker", "traffic" and "articulated"). 

3.3. Discussion of Results 

From Table Q] we can see that our algorithm performs 
well in various artificial instances of hybrid linear modeling 
(with both linear subspace and affine subspace), and its ad- 
vantage is especially obvious with many outliers and affine 
subspaces. The robustness to outliers is a result of our use 
of the t\ error as loss function, and because of the random 
sampling. Also unlike many other methods, the proposed 
method natively supports affine subspace models. 

Table|2]shows that the running time of the proposed algo- 
rithm is less than the running time of most other algorithms, 
especially GPCA, LSA and LSCC. The difference is large 
enough that we can also use the proposed algorithm as an 
initialization for the others. The algorithm is slower than a 



single run of if-fiats, but it usually takes many restarts of 
if-flats to get a decent result. Notice that the choice of C 
and p in our algorithm function in a similar manner to the 
number of restarts in KF 



From Table [3] we can see that the local best-fit flat algo- 
rithm works well for the data set. Of all the methods tested, 
only SCC and SSC had better accuracy. However LBF ran 
4 times faster than SCC and more than 100 times faster than 
SSC. In many of the cases where SSC performed better than 
LBF, the £i energy (as well as the £2 energy) was lower for 
the labels obtained by LBF than the labels obtained by SSC. 
We thus suspect that good clustering of the Hopkins data re- 
quires additional type of clustering (e.g., bottleneck cluster- 
ing) to be combined with subspace clustering (i.e., hybrid 
linear modeling). 



Table 4. The percentage of incorrectness (e%) and the average computation time t of the three methods SOD (LBF), ALC and GPCA. 
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Figure 1 . £>afa se? #5 /rom Section \3.4\ The color value repre- 
sents the number of neighbors chosen at that point. Note that the 
algorithm chooses smaller neighborhoods for points closer to the 
intersection of the planes. 



3.4. Initializing X-flats with good neighborhoods 

Here we demonstrate that our choice of neighborhoods 
can be used to get a more robust initialization of X-flats. 
We work with geometric farthest insertion. For fixed neigh- 
borhood sizes, say of m neighbors, this goes as follows: we 
pick a random point xo and then find the best fit flat Fq for 
the m point neighborhood of Xrj. Then we find the point xi 
in our data farthest from Fq, find the best fit flat F\ of the m 
neighborhood of xi, and then choose the point X2 farthest 
from Fo and F\ to continue. We stop when we have K flats; 
we use these as an initialization for X-flats. 

We work on three data sets. Data set #1 consists of 1500 
points on three parallel 2-planes in K 3 . 500 points are drawn 
from the unit square in x, y plane, and then 500 more from 
lhex,y, z + . 2 plane, and then 500 more from the x, y, z + A 
plane. This data set is designed to favor the use of small 
neighborhoods. The next data set is three random affine 
sets with 15% Gaussian noise and 5% outliers, generated 
using the Matlab code from GPCA, as in Section |3~T1 This 
data set is designed to favor large neighborhood choices. Fi- 
nally, we work on a data set with 1500 points sampled from 
3 planes in K 2 as in FigureQ] The error rates of if-flats with 
farthest insertion initialization with fixed neighborhoods of 
size 10, 20, ..., 160 are plotted against the error rate for far- 
thest insertion with adapted neighborhoods (searched over 
the same range), averaged over 400 runs in Figure [2] Al- 



though our method did not always beat the best fixed neigh- 
borhood, it was quite close; and it always significantly bet- 
ter than the wrong fixed neighborhood size. Both methods 
did significantly better than a random initialization. 

In Figure Q] we plot the number of neighbors picked by 
our algorithm for each point of a realization of data set #3. 

3.5. Automatic determination of the number of 
affine sets 

In this section we show experimentally that using the el- 
bow method on the least squares errors of the outputs of the 
randomized best fit flat method can accurately determine the 
number of affine clusters. 

Let Wk be the total mean squared distance of a data set 
to the flats returned by our algorithm with k affine clus- 
ters specified; as k increases, Wk decreases. A classical 
method for determining the correct number k is to find the 
"elbow", or the k past which adding more clusters does not 
significantly decrease the error. We use the Second Order 
Difference (SOD) formulation of this heuristic Il28ll : 

S , OX>(lnW fc )=lnW fc _i+]nWfc + i-21nW r fcl (6) 

Then the optimal k is found by: 

k op t — argmaxS'0-D(lnW / fc). (7) 

k 

We compare SOD (LBF), i.e., SOD applying LBF, with 
ALC J29l and GPCA O on a number of artificial data 
sets. Similarly to Section 13.11 data sets were generated 
by the Matlab code borrowed from the GPCA package in 
http://perception.csl.uiuc.edu/gpcawith 100d samples from 
each subspace and 0.05 Gaussian noise. For the last four 
experiments, we restrict the angle between subspaces to be 
at least ir/8 for separation. All algorithms are given the di- 
mension d and we choose k max = 10 in SOD (LBF). For 
ALC, we use the oracle choice of the parameter e, setting it 
equal the true noise level. For GPCA, we embed the data to 
a d+1 subspace by PCA and let the tolerance of rank detec- 
tion be 0.05 0]|2). There is no automatic way to choose this 
tolerance, so we tried different values and picked the one 
which matched the ground truth the best. Each experiment 
is repeated 100 times and the error (e%) and the average 
computation time t (in seconds) are recorded in Table [4] 



4. Conclusions and future work 

We presented a very simple geometric method for hy- 
brid linear modeling based on selecting a set of local best 
fit flats that minimize a global t\ error measure. The size of 
the local neighborhoods is determined automatically using 
the I2 f3 numbers; it is proven under certain geometric con- 
ditions that good local neighborhoods exist and are found 
by this method. We give extensive experimental evidence 
demonstrating the state of the art accuracy and speed of the 
algorithm on synthetic and real hybrid linear data. 

We believe that the next step is to adapt the method for 
multi-manifold clustering. As it is, our method, while quite 
good at unions of affine sets, cannot successfully handle 
unions of curved manifolds. We believe that by gluing to- 
gether groups of local best fit flats related by some smooth- 
ness conditions, we will be able to approach the problem of 
clustering data which lies on unions of smooth manifolds. 
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